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(54) Transcription of speech data with segments from acoustically dissimilar environments 



(57) A technique to improve the recognition accura- 
cy when transcribing speech data that contains data 
from a wide range of environments. Input data in many 
situations contains data from a variety of sources in dif- 
ferent environments. Such classes include: clean 
speech, speech corrupted by noise (e.g., music), non- 
speech (e.g., pure music with no speech), telephone 
speech, and the identity of a speaker. A technique is de- 
scribed whereby the differ ent ^lasg gg of ^ata qm fiift 
auto matically identi fied, and then each class is tran- 
scribed by a system that is made specifically for jt . The 
invention also describes a segmentation algorithm that 
is based on making up an acoustic model that charac- 
terizes the data in each class, and then using a dynamic 
programming algorithm (the viterbi algorithm) to auto- 
matically identify segments that belong to each class. 
The acoustic models are made in a certain feature 
space, and the invention also describes different feature 
spaces for use with different classes. 
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Description 

The present invention relates to the transcription of 
data that includes speech in one or more environments 
and non-speech data. 

Speech recognition is an important aspect of fur- 
thering man-machine interaction. The end goal in devel- 
oping speech recognition systems is to replace the key- 
board interface to computers with voice input. To this 
end, several systems have been developed; however, 
these systems typically concentrate on improving the 
transcription error rate on relatively clean data in a con- 
trolled and steady-state environment: i.e., the speaker 
would speak relatively clearly in a quiet environment. 
Though this is not an impractical assumption for appli- 
cations such as transcribing dictation, there are several 
real-wortd situations where these assumptions are not 
valid, i.e., the ambient conditions are noisy or change 
rapidly or both. As the end goal of research in speech 
recognition is the universal use of speech-recognition 
systems in real-world situations (for e.g., information ki- 
osks, transcription of broadcast shows, etc.), it is nec- 
essary to develop speech-recognition systems that op- 
erate under these non-ideal conditions. For instance, in 
the case of broadcast shows, segments of speech from 
the anchor and the correspondents (which are either rel- 
atively ctean, or have music playing in the background) 
are interspersed with music and interviews with people 
(possibly over a telephone, and possibly under noisy 
conditions). 

A speech recognition system designed to decode 
clean speech could be used to decode these different 
classes of data, but would result in a very high error rate 
when transcribing all data classes other than clean 
speech. For instance, if this system were used to de- 
code a segment with pure music : it would produce a 
string of words whereas there is in fact no speech in the 
input, leading to a high insertion error rate. One way to 
solve this problem is to use a "mumble-word" model in 
the speech-recognizer. This mumble-model is designed 
so that it matches the noise-like portion of the acoustic 
input, and hence can eliminate some of the insertion er- 
rors. However, the amount of performance improvement 
obtained by this technique is limited. 

In accordance with the present invention, there is 
now provided a method for transcribing a segment of 
data that includes speech in one or more environments 
and non-speech data, comprising: inputting the data to 
a segmenter and producing a series of segments, each 
segment being given a type-ID selected from a prede- 
termined set of classes; transcribing each type-ID'ed 
segment using a specific system created for that type. 

Viewing the present invention from another aspect, 
there is now provided a system for transcribing a seg- 
ment of data that includes speech in one or more envi- 
ronments and non-speech data, comprising: means for 
inputting the data to a segmenter and producing a series 
of segments, each segment being given a type-ID se- 



lected from a predetermined set of classes; means for 
transcribing each type-ID'ed segment using a specific 
system created for that type. 

In a preferred embodiment of the present invention 
5 there is provided an alternative way of dealing with the 
problem, where the first step is to automatically identify 
each of the distinct classes in the input, and then to use 
a different speech-recognizer to transcribe each of the 
classes of the input data. This significantly improves the 
10 performance on all data classes. 

In a preferred embodiment of the present invention, 
each data class is transcribed by a system that is made 
up specifically for it. These systems are made by trans- 
forming the training data on which the speech-recogni- 
* 5 tion system is trained, so that it matches the acoustic 
environment of the class. For instance, the main char- 
acteristic of telephone speech is that it is band-limited 
from 300-3700 Hz, whereas the clean training data has 
a higher bandwidth. So in order to transform the training 
20 data to better match the telephone-quality speech, the 
training data is band- limited to 300-3700 Hz, and the 
speech-recognition system is trained using this trans- 
formed data. Similarly, for the case of music-corrupted 
speech, pure music is added to the ctean training data 
25 and the speech-recognition system is trained on the mu- 
sic-corrupted training data. 

Preferred embodiments of the present invention 
provide a technique to improve the recognition accuracy 
when transcribing speech data that contains data from 
30 a wide range of environments, as for example broadcast 
news shows. This segmentation procedure does not re- 
quire a script for the speech data and is hence unsuper- 
vised. Broadcast news shows typically contain data 
from a variety of sources in different environments. A 
35 broad (and the most important) categorization of speech 
data yields the following classes: clean speech, speech 
corrupted by noise (lor broadcast news this could be 
music playing in the background), pure noise (or music) 
with no speech, interviews conducted over the tele- 
40 phone, and finally speech in a non -stationary noisy en- 
vironment. Typically, speech recognition systems are 
trained on clean speech; however, if this system is used 
to decode the data from the other classes, it results in 
very poor recognition performance. 
45 in a preferred embodiment of the present invention, 
different classes of data are first automatically identified, 
and then each class is transcribed by a system that is 
made specifically for it. In a preferred embodiment of the 
present invention, there is provided a segmentation al- 
50 gorithm that is based on making up an acoustic model 
that characterizes the data in each class, and then using 
a dynamic programming algorithm (the viterbi algorithm) 
to automatically identify segments that belong to each 
class. The acoustic models are made in a certain feature 
55 space, and the invention also describes different feature 
spaces for use with different classes. 

Preferred embodiments of the present invention will 
now be described, by way of example only, with refer- 
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ence to the accompanying drawings, in which: 

FIG. 1 is a linear segment classification free for a 
preferred embodiment of the present invention; 

FIG. 2 is a Hidden Markov Model classifier for a pre- 
ferred embodiment of the present invention; 

FIG. 3 is a parallel segment classification tree for a 
preferred embodiment of the present invention; 

FIGS. 4 and 4a are Hidden Markov Models for clas- 
sifying segmented speech as corresponding to one 
of a plurality of known speakers, in a preferred em- 
bodiment of the present invention; 

FIG. 5 is a system block diagram of a preferred em- 
bodiment of the present invention; 

FIG. 6 is a flow diagram of an embodiment of the 
present invention; 

FIG. 7(a) is a flow diagram illustrating the addition 
of codebooks in a preferred embodiment of the 
present invention; and, 

FIG. 7(b) is a flow diagram describing a segmenta- 
tion procedure in a preferred embodiment of the 
present invention. 

we describe a technique to improve the recognition 
accuracy when transcribing speech data that contains 
data from a wide range of environments, such as broad- 
cast news shows. The specific classes considered here 
pertain to the broadcast news show, but similar tech- 
niques can be used to deal with other situations that in- 
clude speech from several environments and non- 
speech sounds. Broadcast news shows typically con- 
tain data from a variety of sources in different environ- 
ments, and a broad categorization of these types is as 
follows: clean speech, speech corrupted by music play- 
ing in the background, pure music with no speech, in- 
terviews conducted over the telephone, and finally 
speech in a non-stationary noisy environment, it is pos- 
sible to further differentiate subcategories in the above 
categories, for instance, the clean speech class could 
be further categorized based on the speaker identity, or 
alternate types of microphones etc. However, the above 
described categories are the most important in the 
sense that the above categorization is sufficient to pro- 
vide a significant performance improvement. 

In a preferred embodiment of the present invention, 
the input speech is converted to a stream of fea ture vec- 
tors in time, S 1 ,S T and the objective now is to assign 

each feature vector t o one pf the classes mention ed ear- 
lier. Further, as successive feature vectors are extracted 
from the input at very short time intervals (10 ms or so), 
it is likely that a contiguous segment of feature vectors 



would be assigned the same tag. In the implementation 
described here, it is also possible to ensure this con- 
straint, i.e. that the length of a contiguous set of feature 
vectors that are assigned the same tag is at least larger 

5 than a certain minimum length. It is assumed that the 
training data is labelled, i.e., every one of the stream of 
feature vectors representing the training data has been 
tagged with a class id, and further, that there is sufficient 
data for each class to enable a model to be made for 

io the class. 

One of the problems here is to find a feature vector 
based on which different classes can be easily differen- 
tiated, and this feature vector may be different from that 
used for the speech recognition process. It may also be 

15 the case that different feature vectors are needed to iso- 
late different classes. For such circumstances : it is eas- 
ier to organize the segmentation in the hierarchy of a 
tree as shown in Fig. 1 , where data from a single class 
is identified at each level of the tree. This also allows for 

20 the possibility of using different feature vectors at differ- 
ent levels in the tree. For instance, in Fig. 1 , the feature 
vector f 1 is used to identify segments that belong to class 
c-, , f 2 is used to identify segments belonging to class c 2 
and so on. 

25 The manner in which the segmentation is carried 
out at each level of. the tree is as follows: at each level 
of the tree, a binary decision is made for every feature 
vector in the input stream, i.e., whether the feature vec- 
tor belongs to a specific class or not. In order to do this, 

30 the first step is to generate models that represent the 
distribution of the feature vector for each class. This can 
be done by taking the feature vectors in the training data 
that have tags corresponding to the class, and clustering 
them to form a model M c comprising a mixture of gaus- 

35 sian distributions, G c i , i=1 k v where \i c \X c \ repre- 
sent the means and covariance matrices of the Gaus- 
sians. The same procedure can be used to generate a 
model, M c ; for data that does not belong to the class. 
Let G c ' j, i- 1,....,k 2 represent the Gaussians in this 

40 mixture, and u c -,/A c ',,- represent the means and covari- 
ance matrices of these Gaussians. 

The underlying assumption now is that the input 
feature vector was produced by one of these models, 
and the task is to assign it the tag of the model that gives 

45 it the highest probability. The probability of the input fea- 
ture vector x t belonging to the specified class is given 
by p(x/M c ) and the probability of its not belonging to the 
class is given by p(x/M c _ ). Further, we would also like 
to impose the minimum length constraint mentioned 

so earlier, i.e., that number of contiguous feature vectors 
that are assigned the same tag has to be at least more 
than a specified minimum. This can be done by assum- 
ing a hidden markov model (HMM) for the generation of 
the input data as shown in Fig. 2. The upper path in the 

55 model corresponds to the input data belonging to the 
specified class, and the probability distribution of the 
arcs c n - c n is given by M c . The lower path corresponds 
to the input data not belonging to the specified class, 
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and the probability distribution of the arcs C r C M is giv- 
en by M c -. The minimum lengths of the two paths, N and 
M, constrain the number of contiguous feature vectors 
that are assigned either to the class or not to the class 
to be greater than M and N, respectively. The viterbi al- s 
gorithm is used to trace a path through the trellis corre- 
sponding to the model in Fig. 2, and to assign a class- 
ID c or c to contiguous sets of the input feature vectors. 

To deal with broadcast news, the first level of the 
tree in Fig. 1 was used to isolate pure music segments 10 
from the remainder of the data. Note that so far, we have 
not specified the feature vector f- that is to be used in 
the procedure described above. There are several fea- 
tures that characterize music, and any one of these, or 
a combination, could be used to form the feature vector 75 
^ of Fig. 1. 

A window of the input speech is taken every 10 ms 
(referred to as a frame) and a feature vector is produced 
for every frame. Usually the first step in the process of 
extracting a feature vector is the computation of the en- 20 
ergy or the log of the energy in logarithmically spaced 
frequency bands. One of the characteristics of music is 
that it tends to have approximately equal energy in each 
band. Hence one feature that could be used to charac- 
terize music is the variance of the energy across the dif- 25 
ferent bands in a frame. Another feature that tends to 
distinguish music from speech is the behavior of the 
pitch of the signal. The pitch for a speech signal tends 
to show a large variation about the mean value, whereas 
the pitch for a music signal tends to be relatively con- 30 
stant in time. Hence, another feature that could be used 
to distinguish speech from music is the mean and vari- 
ance of the pitch over time. A third possibility for a fea- 
ture is the cepstra, or a linear combination of the cepstra 
for several frames. (The cepstra are obtained by apply- 35 
ing the Discrete Fourier Transform to a vector whose 
elements comprise the energies in the log-spaced fre- 
quency bands). Yet another possibility is to start with a 
combination of the above features and then compute a 
linear discriminant (see P.O. Duda and PE. Hart Pat- 40 
tern Classification and Scene Analysis, Wylie, N.Y., 
1973) to separate out music and speech (as there are 
only two classes that need to be distinguished, only one 
discriminant vector can be found that separates the two 
classes maximally). 45 

After the pure music segments have been removed, 
the next level in Fig. 1 separates telephone speech from 
regular speech. As mentioned earlier, telephone speech 
is characterized by having a bandwidth of 300-3700 Hz, 
whereas regular clean speech has a much larger band- so 
width (eg. 0-8000 Hz). Hence, the ratio of energy in the 
300-3700 Hz frequency band, to the energy outside this 
band could be used as a feature that would help to iso- 
late telephone and regular speech. Alternate or addi- 
tional possibilities include the cepstra, and linear discri- ss 
minants, as mentioned above for the case of pure music. 

The third level in the tree of Fig. 1 isolates clean 
speech from speech in a noisy background. One of the 
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features that characterizes speech is a large variation 
in its energy across time, whereas noisy speech shows 
much less of a variation. Hence, a possible feature for 
separating clean and noisy speech is the variance of the 
energy across time. As in earlier cases, alternate or ad- 
ditional possibilities include the cepstra and linear dis- 
criminants. 

Once the clean speech segments have been iden- 
tified, it is possible to try and classify them further by the 
identity of the speaker, or the type of microphone being 
used, etc. The segmentation procedure used here is 
identical to that used in the earlier hierarchy of the tree, 
with the difference that the split at this stage is no longer 
binary. For instance, if it is desired to categorize the 
clean speech as one of speakers A, B, C or an unknown 
speaker, then the hierarchy of the tree would be as 
shown in Fig. 3, and the underlying HMM would be as 
shown in Fig. 4. Hence, the arcs labelled a^a,,, have a 
probability distribution that characterizes speaker A, b^ 
b n have a probability distribution that characterizes 
speaker B, and so on, and d-,-d P have a probability dis- 
tribution that characterizes the unknown speaker, that 
represents a speaker-independent model. 

If a transcription is available for the clean speech 
that is to be segmented, then it is possible to use it to 
further improve the segmentation. In this case, the mod- 
el for each speaker is made up by combining a number 
of sub-models where each sub-model characterizes the 
pronunciation of a phonetic class by the speaker. For 
instance, if there are K phonetic classes, each speaker 
model M s is composed of a mixture of sub-models (an 
example of a sub- model is a Gaussian), each sub-mod- 
el corresponding to a phonetic class. 

The input speech is first viterbi aligned (see A.J. Vi- 
terbi, "Error Bounds for Convolutional Codes and An As- 
ymptotically Optimal Decoding Algorithm", IEEE Trans, 
on Info. Theory, vol IT-13, pp. 260-69, April 1967) 
against the given script using a speaker independent 
model in order to assign a phonetic tag to each feature 
vector and to isolate regions of silence. The task of the 
segmenter now is to assign a speaker id tag to the 
speech segments between two silence regions, given 
the phonetic tag of every feature vector in the segment. 

The acoustic model for assigning a speaker id tag 
to the segment between two consecutive silence re- 
gions is shown in FIG. 4a. There, it is assumed that the 
length of the segment between the two silences is M, 
and the phonetic tags assigned to the feature vectors 
are jv ■■■•Jm- Tne likelihood of a feature vector is com- 
puted given the sub-model of the speakers correspond- 
ing to the phonetic class that was assigned to the feature 

vector, for instance j 1 , is the segment, and aj 1 , b^-, , : Uj-j 

represent the sub-models of the various speakers for 
the phonetic class j v 

Once the input speech has been segmented into 
the desired classes, the next step is to transcribe each 
segment with a speech recognizer that was made spe- 
cifically for it. These systems are made by transforming 
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the training data on which the speech-recognition sys- 
tem is trained, so that it matches the acoustic environ- 
ment of the class. For instance, the main characteristic 
of telephone speech is that it is band-limited from 
300-3700 Hz, whereas the clean training data has a s 
higher bandwidth. So in order to transform the training 
data to better match the telephone-quality speech, the 
training data is band- limited to 300-3700 Hz, and the 
speech-recognition system is trained using this trans- 
formed data. Similarly, for the case of music-corrupted io 
speech, pure music is added to the clean training data 
and the speech-recognition system is trained on the mu- 
sic-corrupted training data. 

FIG. 5 is a block diagram of a system in accordance 
with the present invention. The invention can be imple- t$ 
mented on a general purpose computer programmed to 
provide the functional blocks shown in FIG. 5. In FIG. 5, 
the input data, which can include clean speech, music- 
corrupted speech, pure music and/or telephone speech, 
is input to a segmenter 500. The segmenter segments 20 
the input data into segments classified as one of the 
foregoing types of input data. Each segment is then di- 
rected to a decoder corresponding to the its type: tele- 
phone speech segments to telephone speech decoder 
502; clean speech to clean speech decoder 506; and zs 
music corrupted speech to music corrupted speech de- 
coder 508. Each of decoders 502, 506 and 508 produc- 
es a decoded output of the input segment, and these 
outputs are presented as the decoded output of the sys- 
tem. In the case of pure music, no decoded output is in 30 
fact produced, since no speech data is contained therein 
for decoding. 

FIG. 6 shows a flow diagram of an embodiment of 
the present invention. Again, these steps can imple- 
mented on a general purpose computer programmed in 35 
accordance with the details set forth herein. The first 
step of the process as shown in FIG. 6 is to provide input 
data that may contain pure music, music corrupted 
speech, telephone speech and/or clean speech. This in- 
put is segmented in step 602 to produce a series of seg- 40 
ments representing the input data, each segment being 
given an ID selected from a predetermined set of ID's: 
pure music, music corrupted speech, telephone speech 
and/or clean speech. The segments are then decoded 
and transcribed in step 604 using a particular decoding 
means tailored to the particular type of data. The result 
of this operation is presented as the decoded output. 

A preferred embodiment of the present invention al- 
lows for the segmentation of speech utterances on the 
basis of the speaker. In other words, under stationary so 
conditions of noise and channel, the method can seg- 
ment input speech according to the identity of the speak- 
ers. It is not necessary that the speakers be known in 
advance. The number of speakers may also be un- 
known. If the channel or the background noise changes 55 
(i.e. music appears), the method can also segment ac- 
cordingly. Thus, this aspect of the invention allows un- 
supervised segmentation of speech utterances on the 



basis of the speaker channel and background. It does 
not require any a-priori information about the signal, but 
can easily use any such information. It is a fast method 
which does not require any actual speech recognition 
phase in order to make a decision. This aspect of the 
invention also regroups the different segments which 
present the same characteristics (i.e. same speaker 
through same channel and with similar background), so 
that systems can be trained on line to the specificity of 
each class of segments. 

Figure 7(a) illustrates a method corresponding to 
this embodiment of the present invention that allows the 
addition of new codebooks to an existing system. The 
admissibility check 702 selects feature vectors that are 
not confusable with existing codebooks. The feature 
vectors are then vector quantized at step 704, and the 
resulting codebooks passed to blocks 706 and 708, 
where the means and variances are computed, respec- 
tively. The means and variances are then stored at block 
710. An account of all feature vectors that are outside 
the means and variances of a particular codebook are 
maintained in scores block 71 2. If the number such vec- 
tors exceeds a predetermined threshold, the feature 
vector is discarded. Figure 7(b) describes the procedure 
for testing. Each frame is decoded at step 802 with the 
set of codebooks currently existing, using input from 
block 804. A set of histograms is then generated at block 
806, and these histograms are used to select a code- 
book at step 81 0. The histograms represent the number 
of feature vectors in the test speech that match-up with 
each codebook. The potential codebook is checked for 
consistency ( ratios of distances to codewords to the var- 
iances of these codewords) at step 808. If the selection 
is acceptable, the frames are tagged by a codebook in- 
dex. It is optional to use the data to update the associ- 
ated frames. In general, stability suggests not to use 
more than 10 seconds worth of data to define a code- 
book. If the consistency test fails, a new codebook is 
added to the set using admissible frames at step 812. 

Note that the method not only segments the files 
when there is a change of speaker, but it can also re- 
group segments associated with a particular speaker (or 
under similar conditions). 

Because no information is known about the speak- 
ers, there is no training phase required in this aspect of 
the invention. However, because some thresholds are 
required, these thresholds are established prior to test- 
ing, using similar data. The threshold selection can be 
done by trial and error. 

Of course, if any a-priori information is known, it can 
be incorporated in the codebook. For example, if some 
speakers are known, their models can be loaded in ad- 
vance and only new speakers will be subsequently add- 
ed to the database. If music is suspected in the back- 
ground, codebooks can be built with pure speech, pure 
music and different types of music plus speech. 

The above segmentation is done in an unsuper- 
vised way and independently of any speech recognition. 
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tn a preferred configuration the feature vectors (ob- 
tained as output of the acoustic front-end) are the MEL 
cepstra, delta and delta-delta (including CO (energy)). 
They are 39 dimension vectors. The features are usually 
computed on frames of 30ms with shifts of 10ms. Note 
that it is purely a function of the speech recognizer for 
which model prefetching is implemented. If other feature 
vectors (like LPC cepstra) are used, the feature vector 
used by the segmenter can be LPC cepstra. However, 
the efficiency of the segmenter depends on the set of 
features used. Of course, it is always possible to use 
different feature sets for the recognizer and the seg- 
menter. 

The segmenter system stores a minimum of infor- 
mation about each type of segment: for example, a 
codebook of 65 codewords (although the number of 
codewords is not critical), their variances and some op- 
tional scores obtained during enrolment. In particular, it 
is not necessary to store the training utterances. 

Once a codebook is finally selected, the system can 
tag the associated frames accordingly. 

Segmentation is performed using a generalized 
vector quantization algorithm. 

when a new codebook is added to the database, 
the feature vectors selected as admissible are clustered 
in a set of, for example, 65 codewords. The variance of 
each cluster is also stored. Eventually, some additional 
scores are stored for each codeword, including, the 
number of feature vectors associated with this code- 
word while being far apart from it. Two distances can be 30 
used: a Mahalanobis distance which is a euclidian dis- 
tance with weights that are the inverse of the variances 
of each dimension of the feature vector (These weights 
can be decided a-priori based on the data used to train 
the speech recognition application or on the fly on the 35 
training and/or testing data), or a probabilistic distance 
where the distance is the log-likelihood of the Gaussian 
associated with the codeword (same means and vari- 
ances). Typically, 10 seconds of speech are used for 
training. Feature vectors are considered admissible if 40 
they are not confusable with existing codewords. This 
means that some vectors too close to existing code- 
words are rejected during clustering. If this is not done, 
instabilities will occur due to overlapping of clusters. 

During testing, or actual use of the system, the fea- 45 
ture vectors are obtained from the acoustic front-end. 
After about three seconds of speech, a candidate tag 
will begin to emerge. After about 5 to 10 seconds, a final 
decision is made. The testing is implemented as a gen- 
eralized VQ decoder. On a frame by frame basis, it iden- so 
tifies the closest codebook (or ranks the N closest). A 
histogram is then created which counts how many 
frames have selected each codebook. The codebook 
which was most often selected identifies the potential 
tag. 55 

At this stage, the segmenter checks if the potential 
tag is a consistent choice. If it is not, a new codebook is 
added using admissible frames. The consistency is 



checked based on different tests. Firstly, the histogram 
is inspected. A clear maxima indicates a good chance 
that the choice is correct, if a set of competing code- 
books emerges, tests on the variances are more critical. 
s The test of variances are defined as follows: for each 
feature vector, its distance to the selected codeword or 
the competing codewords are compared to their asso- 
ciated variances. If the distances are too large given the 
associated scores, the codebook is rejected. If no code- 
10 book is eventually accepted, none is identified and a 
new codebook must be built. If one codebook remains 
acceptable, then it identifies the tag. 

while the invention has been described in particular 
with respect to specific embodiments thereof, it will un- 
derstood that modifications to these embodiments can 
be effected without departing from the scope of the in- 
vention. 



20 Claims 

1. A method for transcribing a segment of data that 
includes speech in one or more environments and 
non-speech data, comprising: 

25 

inputting the data to a segmenter and produc- 
ing a series of segments, each segment being 
given a type-ID selected from a predetermined 
set of classes; 

transcribing each type-ID'ed segment using a 
specific system created for that type. 

2. The method of claim 1, wherein the step of seg- 
menting comprises: 

identifying a number of classes that the 
acoustic input can be classified into that represent 
the most acoustically dissimilar classes possible. 

3. The method of claim 2, wherein the classes include 
non-speech, telephone speech, noise-corrupted 
speech, and clean speech. 

4. The method of claim 2, wherein the step of assign- 
ing a type-ID comprises: 

assuming that the input data is produced by a 
parallel combination of models, each model 
corresponding to one of the predetermined 
classes; 

the class ID assigned to a segment being the 
class ID of the model that gives the segment 
the highest probability, subject to certain con- 
straints. 

5. The method of claim 4, wherein one of the con- 
straints is a minimum duration on the segment. 
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6. The method of claim 2, wherein the step of assign- 
ing a type-ID comprises: 

using a binary tree hierarchy, wherein at each 
level of the tree segments corresponding to one of 
the predetermined classifications are isolated. 

7. The method of claim 1 , wherein the step of segmen- 
tation is carried out using a Hidden Markov Model 
to model each class and the viterbi algorithm to iso- 
late and assign type-ID's to the segments. 

8. The method of claim 4, wherein the process of cre- 
ating the models comprises identifying a feature 
space for the individual predetermined classes. 

9. The method of claim 8, wherein the feature space 
for the model for non-speech is created by any one 
of: 

taking a window of the input speech every 10 
milliseconds and computing a vector compris- 
ing the energy or log energy in logarithmically 
spaced frequency bands on that window, the 
feature being the variance across the dimen- 
sions of the vector; 

taking a window of the input speech every 10 
milliseconds, computing the log of the energy 
in logarithmically spaced frequency bands, and 
computing the cepstra from this vector, the fea- 
ture being the cepstra; 

taking a window of the input speech every 10 
milliseconds, computing the log of the energy 
in logarithmically spaced frequency bands, and 
computing a linear discriminant to separate out 
non-speech and speech; 

taking a window of the input speech every 10 
milliseconds, computing the log of the energy 
in logarithmically spaced frequency bands, and 
computing the variance across the dimensions 
of the vector, the cepstra of vector and a linear 
discriminant, wherein the feature is the vari- 
ance across the dimensions of the vector, the 
cepstra of the vector or a linear discriminant; 
and, 

taking a window of the input speech every 10 
milliseconds and computing the pitch, wherein 
the feature is the mean and the variance of the 
pitch across a plurality of consecutive windows. 

10. The method of claim 8, wherein the feature space 
for the model for telephone speech is created by 
any one of: 

taking a window of the input speech every 10 



milliseconds and computing a ratio of the ener- 
gies in the telephone frequency band 
(300-3700 Hz) to the total energy of the signal; 

s taking a window of the input speech every 10 

milliseconds, computing the log of the energy 
in logarithmically spaced frequency bands, and 
computing the cepstra from this vector, the fea- 
ture being the cepstra; 

w 

taking a window of the input speech every 10 
milliseconds, computing the log of the energy 
in logarithmically spaced frequency bands, and 
computing a linear discriminant to separate tel- 
15 ephone speech and non-telephone speech. 

11. The method of claim 8 : wherein the feature space 
for the model for clean speech is created by any one 

of: 

20 

taking a window of the input speech every 10 
milliseconds, and computing the energy in the 
window, wherein the feature is related to the 
variation of energy across a plurality of consec- 
25 utive windows; 

taking a window of the input speech every 10 
milliseconds, computing the log of the energy 
in logarithmically spaced frequency bands, and 
30 computing the cepstra from this vector, the fea- 

ture being the cepstra; and, 

taking a window of the input speech every 10 
milliseconds, computing the log of the energy 
35 jn logarithmically spaced frequency bands, and 

computing a linear discriminant to separate out 
clean speech and noisy speech. 

12. The method of claim 3, wherein clean speech seg- 
40 ments are further segmented into smaller segments 

that can be assigned a speaker ID tag. 

13. The method of claim 1, further comprising creating 
a system for transcribing data from each class ID. 

45 

14. The method of claim 12, further comprising provid- 
ing a script to allow supervised speaker identifica- 
tion and thereby improve the speaker ID segmen- 
tation. 

so 

1 5. The method of claim 1 4, wherein the models for the 
training speakers are generated by combining sub- 
models that correspond to each phonetic or sub- 
phonetic class. 

55 

16. The method of claim 14, wherein first the clear 
speech is viterbi aligned against the given script, us- 
ing speaker independent models, to identify regions 
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of silence and to tag every feature vector between 
two consecutive silence regions with the ID of a 
phonetic (or sub-phonetic) class. 

17. The method of claim 16, wherein a speaker ID is 
assigned to a speech segment between two con- 
secutive silences, where the likelihood of each fea- 
ture vector is computed given each speaker model 
for the sub-phonetic class that was assigned to that 
feature vector. 
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non-speech data, comprising: 

means for inputting the data to a segmenter and 
producing a series of segments, each segment 
being given a type-ID selected from a predeter- 
mined set of classes; 

means for transcribing each type-ID'ed seg- 
ment using a specific system created for that 
type. 



1 8. The method of claim 3, wherein setting up a system 
for transcribing telephone speech comprises: 

transforming the training data from which the 1$ 
speech recognition system was made so that it 
matches the acoustic environment of telephone 
speech; 

wherein the transformation comprises band 
limiting the training data to telephone band- 20 
widths. 



19. The method of claim 3, wherein setting up a system 
for transcribing music -corrupted speech comprises: 

transforming the training data from which the 
speech recognition system was made so that it 
matches the acoustic environment of music- 
corrupted speech; 

wherein the transformation comprises adding 
pure music to the clean speech in the training 
data. 



25 



30 



20. The method of claim 12, wherein the procedure for 
segmenting is carried out using a parallel technique 
using a word transcription for the clean speech. 



35 



21. The method of claim 1 , wherein the classes include 
the identity of a speaker. 



40 



22. The method of claim 1 , wherein one of the classes 
in the predetermined set of classes is a speaker 
identification class. 



23. The method of claim 22, wherein the speaker iden- 
tification classes are not known a priori and are de- 
termined automatically based on updating classes 
corresponding to the speakers. 

24. The method of claim 22, wherein the speaker iden- 
tification classes further comprise varying back- 
ground environments, wherein speaker identifica- 
tion classes are determined in light of those varying 
environments. 



so 



55 



25. A system for transcribing a segment of data that in- 
cludes speech in one or more environments and 
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(57) A technique to improve the recognition accura- 
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speech, and the identity of a speaker. A technique is de- 
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invention also describes a segmentation algorithm that 
is based on making up an acoustic model that charac- 
terizes the data in each class, and then using a dynamic 
programming algorithm (the viterbi algorithm) to auto- 
matically identify segments that belong to each class. 
The acoustic models are made in a certain feature 
space, and the invention also describes different feature 
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